Compressed instruction format for use in a VLIW processor

ABSTRACT

A compressed instruction format for a VLIW processor allows greater efficiency in use of cache and memory. Instructions are byte aligned and variable length. Branch targets are uncompressed. Format bits specify how many issue slots are used in a following instruction. NOPS are not stored in memory. Individual operations are compressed according to features such as whether they are resultless, guarded, short, zeroary, unary, or binary. Instructions are stored in compressed form in memory and in cache. Instructions are decompressed on the fly after being read out from cache.

1. BACKGROUND OF THE INVENTION

[0001] 1. Field of the Invention

[0002] The invention relates to VLIW (Very Long Instruction Word)processors and in particular to instruction formats for such processorsand apparatus for processing such instruction formats.

[0003] 2. Background of the Invention

[0004] VLIW processors have instruction words including a plurality ofissue slots. The processors also include a plurality of functionalunits. Each functional unit is for executing a set of operations of agiven type. Each functional unit is RISC-like in that it can begin aninstruction in each machine cycle in a pipe-lined manner. Each issueslot is for holding a respective operation. All of the operations in asame instruction word are to be begun in parallel on the functional unitin a single cycle of the processor. Thus the VLIW implementsfine-grained parallelism.

[0005] Thus, typically an instruction on a VLIW machine includes aplurality of operations. On conventional machines, each operation mightbe referred to as a separate instruction. However, in the VLIW machine,each instruction is composed of operations or no-ops (dummy operations).

[0006] Like conventional processors, VLIW processors use a memorydevice, such as a disk drive to store instruction streams for executionon the processor. A VLIW processor can also use caches, likeconventional processors, to store pieces of the instruction streams withhigh bandwidth accessibility to the processor.

[0007] The instruction in the VLIW machine is built up by a programmeror compiler out of these operations. Thus the scheduling in the VLIWprocessor is software-controlled.

[0008] The VLIW processor can be compared with other types of parallelprocessors such as vector processors and superscalar processors asfollows. Vector processors have single operations which are performed onmultiple data items simultaneously. Superscalar processors implementfine-grained parallelism, like the VLIW processors, but unlike the VLIWprocessor, the superscalar processor schedules operations in hardware.

[0009] Because of the long instruction words, the VLIW processor hasaggravated problems with cache use. In particular, large code sizecauses cache misses, i.e. situations where needed instructions are notin cache. Large code size also requires a higher main memory bandwidthto transfer code from the main memory to the cache.

[0010] Large code size can be aggravated by the following factors.

[0011] In order to fine tune programs for optimal running, techniquessuch as grafting, loop unrolling, and procedure inlining are used. Theseprocedures increase code size.

[0012] Not all issue slots are used in each instruction. A goodoptimizing compiler can reduce the number of unused issue slots; howevera certain number of no-ops (dummy instructions) will continue to bepresent in most instruction streams.

[0013] In order to optimize use of the functional units, operations onconditional branches are typically begun prior to expiration of thebranch delay, i.e. before it is known which branch is going to be taken.To resolve which results are actually to be used, guard bits areincluded with the instructions.

[0014] Larger register files, preferably used on newer processor types,require longer addresses, which have to be included with operations.

[0015] A scheme for compression of VLIW instructions has been proposedin U.S. Pat. Nos. 5,179,680 and 5,057,837. This compression schemeeliminates unused operations in an instruction word using a mask word,but there is more room to compress the instruction.

2. SUMMARY OF THE INVENTION

[0016] It is an object of the invention to reduce code size in a VLIWprocessor.

[0017] This object is met by using a compression scheme in which, withinan instruction having a plurality of operations, each operation iscompressed. Compression includes assigning a compressed operation lengthto the operation. The compression includes choosing one of a pluralityof finite lengths. The finite lengths include at least one non-zerolength. Which length is chosen depends on a feature of the operation.

[0018] Branch targets are not compressed. For each instruction,information about compression format is stored in a previousinstruction.

3. Further Information About Technical Background to this Application

[0019] The following prior applications are incorporated herein byreference:

[0020] U.S. application Ser. No. 998,090, filed Dec. 29, 1992 (PHA21,777), which shows a VLIW processor architecture for implementingfine-grained parallelism;

[0021] U.S. application Ser. No. 142,648 filed Oct. 25, 1993 (PHA 1205),which shows use of guard bits; and

[0022] U.S. application Ser. No. 366,958 filed Dec. 30, 1994 (PHA21,932) which shows a register file for use with VLIW architecture.

[0023] Bibliography of program compression techniques:

[0024] J. Wang et al, “The Feasibility of Using Compression to IncreaseMemory System Performance”, Proc. 2nd Int. Workshop on ModelingAnalysis, and Simulation of Computer and Telecommunications Systems, p.107-113 (Durham, N.C., USA 1994);

[0025] H. Schröder et al., “Program compression on the instructionsystolic array”, Parallel Computing, vol. 17, n 2-3, Jun. 1991,p.207-219;

[0026] A. Wolfe et al., “Executing Compressed Programs on an EmbeddedRISC Architecture”, J. Computer and Software Engineering, vol. 2, no. 3,pp 315-27, (1994);

[0027] M. Kozuch et al., “Compression of Embedded Systems Programs”,Proc. 1994 IEEE Int. Conf. on Computer Design: VLSI in Computers andProcessors (Oct. 10-12, 1994, Cambridge Mass., USA) pp.270-7.

[0028] Typically the approach adopted in these documents has been toattempt to compress a program as a whole or blocks of program code.Moreover, typically some table of instruction locations or locations ofblocks of instructions is necessitated by these approaches.

4. BRIEF DESCRIPTION OF THE DRAWING

[0029] The invention will now be described by way of non-limitativeexample with reference to the following figures:

[0030]FIG. 1a shows a processor for using the compressed instructionformat of the invention.

[0031]FIG. 1b shows more detail of the CPU of the processor of FIG. 1a.

[0032]FIGS. 2a-2 e show possible positions of instructions in cache.

[0033]FIG. 3 illustrates a part of the compression scheme in accordancewith the invention.

[0034]FIGS. 4a-4 f illustrate examples of compressed instructions inaccordance with the invention.

[0035]FIGS. 5a-5 b give a table of compressed instructions formatsaccording to the invention.

[0036]FIG. 6a is a schematic showing the functioning of instructioncache 103 on input.

[0037]FIG. 6b is a schematic showing the functioning of a portion of theinstruction cache 103 on output.

[0038]FIG. 7 is a schematic showing the functioning of instruction cache104 on output.

[0039]FIG. 8 illustrates compilation and linking of code according tothe invention.

[0040]FIG. 9 is a flow chart of compression and shuffling modules.

[0041]FIG. 10 expands box 902 of FIG. 9.

[0042]FIG. 11 expands box 1005 of FIG. 10.

[0043]FIG. 12 illustrates the decompression process.

5. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0044]FIG. 1a shows the general structure of a processor according tothe invention. A microprocessor according to the invention includes aCPU 102, an instruction cache 103, and a data cache 105. The CPU isconnected to the caches by high bandwidth buses. The microprocessor alsocontains a memory 104 where an instruction stream is stored.

[0045] The cache 103 is structured to have 512 bit double words. Theindividual bytes in the words are addressable, but the bits are not.Bytes are 8 bits long. Preferably the double words are accessible as asingle word in a single clock cycle.

[0046] The instruction stream is stored as instructions in a compressedformat in accordance with the invention. The compressed format is usedboth in the memory 104 and in the cache 103.

[0047]FIG. 1b shows more detail of the VLIW processor according to theinvention. The processor includes a multiport register file 150, anumber of functional units 151, 152, 153, . . . , and an instructionissue register 152. The multiport register file stores results from andoperands for the functional units. The instruction issue registerincludes a plurality of issue slots for containing operations to becommenced in a single clock cycle, in parallel, on the functional units151, 152, 153, . . . A decompression unit 155, explained more fullybelow, converts the compressed instructions from the instruction cache103 into a form usable by the IIR 154.

COMPRESSED INSTRUCTION FORMAT

[0048] 1. General Characteristics

[0049] The preferred embodiment of the claimed instruction format isoptimized for use in a VLIW machine having an instruction word whichcontains 5 issue slots. The format has the following characteristics

[0050] unaligned, variable length instructions;

[0051] variable number of operations per instruction;

[0052] 3 possible sizes of operations: 26, 34 or 42 bits (also called a26/34/42 format).

[0053] the 32 most frequently used operations are encoded more compactlythan the other operations;

[0054] operations can be guarded or unguarded;

[0055] operations are one of zeroary, unary, or binary, i.e. they have0, 1 or 2 operands;

[0056] operations can be resultless;

[0057] operations can contain immediate parameters having 7 or 32 bits

[0058] branch targets are not compressed; and

[0059] format bits for an instruction are located in the priorinstruction.

[0060] 2. Instruction Alignment

[0061] Except for branch targets, instructions are stored aligned onbyte boundaries in cache and main memory. Instructions are unalignedwith respect to word or block boundaries in either cache or main memory.Unaligned instruction cache access is therefore needed. In order toretrieve unaligned instructions, processor retrieves one word per clockcycle from the cache.

[0062] As will be seen from the compression format described below,branch targets need to be uncompressed and must fall within a singleword of the cache, so that they can be retrieved in a single clockcycle. Branch targets are aligned by the compiler or programmeraccording to the following rule:

[0063] if a word boundary falls within the branch target or exactly atthe end of the branch target, padding is added to make the branch targetstart at the next word boundary

[0064] Because the preferred cache retrieves double words in a singleclock cycle, the rule above can be modified to substitute double wordboundaries for word boundaries.

[0065] The normal unaligned instructions are retrieved so thatsucceeding instructions are assembled from the tail portion of thecurrent word and an initial portion of the succeeding word. Similarly,all subsequent instructions may be assembled from 2 cache words,retrieving an additional word in each clock cycle.

[0066] This means that whenever code segments are relocated (forinstance in the linker or in the loader) alignment must be maintained.This can be achieved by relocating base addresses of the code segmentsto multiples of the cache block size.

[0067]FIGS. 2a-e show unaligned instruction storage in cache inaccordance with the invention.

[0068]FIG. 2a shows two cache words with three instructions i1, i2, andi3 in accordance with the invention. The instructions are unaligned withrespect to word boundaries. Instructions i1 and i2 can be branchtargets, because they fall entirely within a cache word. Instruction i3crosses a word boundary and therefore must not be a branch target. Forthe purposes of these examples, however, it will be assumed that i1 andonly i1 is a branch target.

[0069]FIG. 2b shows an impermissible situation. Branch target i1 crossesa word boundary. Accordingly, the compiler or programmer must shift theinstruction i1 to a word boundary and fill the open area with paddingbytes, as shown in FIG. 2c.

[0070]FIG. 2d shows another impermissible situation. Branch targetinstruction i1 ends precisely at a word boundary. In this situation,again i1 must be moved over to a word boundary and the open area filledwith padding as shown in FIG. 2e.

[0071] Branch targets must be instructions, rather than operationswithin instructions. The instruction compression techniques describedbelow generally eliminate no-ops (dummy instructions). However, becausethe branch target instructions are uncompressed, they must containno-ops to fill the issue slots which are not to be used by theprocessor.

[0072] 3. Bit and Byte Order

[0073] Throughout this application bit and byte order are little endian.Bits and bytes are listed with the least significant bits first, asbelow: Bit number 0 . . . 8 . . . 16 . . . Byte number 0   1   2 address0   1   2

[0074] 4. Instruction Format

[0075] The compressed instruction can have up to seven types of fields.These are listed below. The format bits are the only mandatory field.

[0076] The instructions are composed of byte aligned sections. The firsttwo bytes contain the format bits and the first group of 2-bit operationparts. All of the other fields are integral multiples of a byte, exceptfor the second 2-bit operation parts which contain padding bits.

[0077] The operations, as explained above can have 26, 34, or 42 bits.26-bit operations are broken up into a 2-bit part to be stored with theformat bits and a 24-bit part. 34-bit operations are broken up into a 2bit part, a 24-bit part, and a one byte extension. 42-bit operations arebroken up into a 2 bit part, a 24-bit part, and a two byte extension.

[0078] A. Format Bits

[0079] These are described in section 5 below. With a 5 issue slotmachine, 10 format bits are needed. Thus, one byte plus two bits areused.

[0080] B. 2-Bit Operation Parts, First Group

[0081] While most of each operation is stored in the 24-bit partexplained below, i.e. 3 bytes, with the preferred instruction set 24bits was not adequate. The shortest operations required 26 bits.Accordingly, it was found that the six bits left over in the bytes forthe format bit field could advantageously be used to store extra bitsfrom the operations, two bits for each of three operations. If the sixbits designated for the 2-bit parts are not needed, they can be filledwith padding bits.

[0082] C. 24-Bit Operation Parts, First Group

[0083] There will be as many 24 bit operation parts as there were 2 bitoperation parts in the two bit operation parts, first group. In otherwords, up to three 3 byte operation parts can be stored here.

[0084] D. 2 Bit Operation Parts, Second Group

[0085] In machines with more than 3 issue slots a second group of 2-bitand 24-bit operation parts is necessary. The second group of 2-bit partsconsists of a byte with 4 sets of 2-bit parts. If any issue slot isunused, its bit positions are filled with padding bits. Padding bits siton the left side of the byte. In a five issue slot machine, with allslots used, this section would contain 4 padding bits followed by twogroups of 2-bit parts. The five issue slots are spread out over the twogroups: 3 issue slots in the first group and 2 issue slots in the secondgroup.

[0086] E. 24-Bit Operation Parts, Second Group

[0087] The group of 2-bit parts is followed by a corresponding group of24 bit operation parts. In a five issue slot machine with all slotsused, there would be two 24-bit parts in this group.

[0088] F. Further Groups of 2-Bit and 24-Bit Parts

[0089] In a very wide machine, i.e. more than 6 issue slots, furthergroups of 2-bit and 24-bit operation parts are necessary.

[0090] G. Operation Extension

[0091] At the end of the instruction there is a byte-aligned group ofoptional 8 or 16 bit operation extensions, each of them byte aligned.The extensions are used to extend the size of the operations from thebasic 26 bit to 34 or 42 bit, if needed.

[0092] The formal specification for the instruction format is:

[0093] <instruction>::=

[0094] <instruction start>

[0095] <instruction middle>

[0096] <instruction end>

[0097] <instruction extension>

[0098] <instruction start>::>

[0099] <Format:2*N>{<padding:1>}V2{<2-bit operation part:2>}V1{<24-bitoperation part:24>}V1

[0100] <instruction middle>::={{<2-bit operation part:2>}4 {24-bitoperation part:24>}4}V3

[0101] <instruction end>::={<padding:1>}V5{<2-bit operation part:2>}V4{24-bit operation part:24>}V4

[0102] <instruction extension>::={<operationextension:0/8/16>}S

[0103] <padding>::=“0”

[0104] Wherein the variables used above are defined as follows:

[0105] N=the number of issue slots of the machine, N>1

[0106] S=the number of issue slots used in this instruction (0≦S≦N)

[0107] C1=4−(N mod 4)

[0108] If (S≦C1) then V1=S and V2=2*(C1−V1)

[0109] If (S>C1) then V1=C1 and V2=0

[0110] V3=(S−V1) div 4

[0111] V4=(S−V1) mod 4

[0112] If (V4>0) then V5=2*(4−V4) else V5=0

[0113] Explanation of notation

[0114] ::=means “is defined as”

[0115] <field name:number>

[0116] means the field indicated before the colon has the number of bitsindicated after the colon.

[0117] {<field name>}number

[0118] means the field indicated in the angle brackets and braces isrepeated the number of times indicated after the braces

[0119] “0” means the bit “0”.

[0120] “div” means integer divide

[0121] “mod” means modulo

[0122] :0/8/16

[0123] means that the field is 0, 8, or 16 bits long

[0124] Examples of compressed instructions are shown in FIGS. 4 a-f.

[0125]FIG. 4a shows an instruction with no operations. The instructioncontains two bytes, including 10 bits for the format field and 6 bitswhich contain only padding. The former is present in all theinstructions. The latter normally correspond to the 2-bit operationparts. The X's at the top of the bit field indicate that the fieldscontain padding. In the later figures, an O is used to indicate that thefields are used.

[0126]FIG. 4b shows an instruction with one 26-bit operation. Theoperation includes one 24 bit part at bytes 3-5 and one 2 bit part inbyte 2. The 2 bits which are used are marked with an O at the top.

[0127]FIG. 4c shows an instruction with two 26-bit operations. The first26-bit operation has its 24-bit part in bytes 3-5 and its extra two bitsin the last of the 2-bit part fields. The second 26-bit operation hasits 24-bit part in bytes 6-8 and its extra two bits in the second tolast of the 2-bit part fields.

[0128]FIG. 4d shows an instruction with three 26-bit operations. The24-bit parts are located in bytes 3-11 and the 2-bit parts are locatedin byte 2 in reversed order from the 24-bit parts.

[0129]FIG. 4e shows an instruction with four operations. The secondoperation has a 2 byte extension. The fourth operation has a one byteextension. The 24-bit parts of the operations are stored in bytes 3-11and 13-15. The 2-bit parts of the first three operations are located inbyte 2. The 2-bit part of the fourth operation is located in byte 12. Anextension for operation 2 is located in bytes 16-17. An extension foroperation 4 is located in byte 18.

[0130]FIG. 4f shows an instruction with 5 operations each of which has aone byte extension. The extensions all appear at the end of theinstruction.

[0131] While extensions only appear after the second group of 2-bitparts in the examples, they could equally well appear at the end of aninstruction with 3 or less operations. In such a case the second groupof 2-bit parts would not be needed.

[0132] There is no fixed relationship between the position of operationsin the instruction and the issue slot in which they are issued.

[0133] This makes it possible to make an instruction shorter when notall issue slots are used. Operation positions are filled from left toright. The Format section of the instruction indicates to which issueslot a particular operation belongs. For instance, if any instructioncontains only one operation, then it is located in the first operationposition and it can be issued to any issue slot, not just slot number 1.The decompression hardware takes care of routing operation to theirproper issue slots.

[0134] No padding bytes are allowed between instructions that form onesequential block of code. Padding blocks are allowed between distinctblocks of code.

[0135] 5. Format Bits

[0136] The instruction compression technique of the invention ischaracterized by the use of a format field which specifies which issueslots are to be used by the compressed instruction. To achieve retrievalefficiency, format bits are stored in the instruction preceding theinstruction to which the format bits relate. This allows pipelining ofinstruction retrieval. The decompression unit is alerted to how manyissue slots to expect in the instruction to follow prior to retrieval ofthat instruction. The storage of format bits preceding the operations towhich they relate is illustrated in FIG. 3. Instruction 1, which is anuncompressed branch target, contains a format field which indicates theissue slots used by the operations specified in instruction 2.Instructions 2 through 4 are compressed. Each contains a format fieldwhich specifies issue slots to be used by the operations of thesubsequent instruction.

[0137] The format bits are encoded as follows. There are 2*N format bitsfor an N-issue slot machine. In the case of the preferred embodiment,there are five issue slots. Accordingly, there are 10 format bits.Herein the format bits will be referred to in matrix notation asFormat[j] where j is the bit number. The format bits are organized in Ngroups of 2 bits. Bits Format[2i] and Format[2i+1] give formatinformation about issue slot i, where 0≦i≦N. The meaning of the formatbits is explained in the following table: TABLE I Format [2i] Format[2i + 1] lsb msb meaning 0 0 Issue slot i is used and an operation forit is available in the instruction. The operation size is 26 bits. Thesize of the extension is 0 bytes 1 0 Issue slot i is used and anoperation for it is available in the instruction. The operation size is34 bits. The size of the extension is 1 byte. 0 1 Issue slot i is usedand an operation for it is available in the instruction. The operationsize is 42 bits. The size of the extension is 2 bytes. 1 1 Issue slot iis unused and no operation for it is included in the instruction.

[0138] Operations correspond to issue slots in left to right order. Forinstance, if 2 issue slots are used, and Format={1, 0, 1, 1, 1, 1, 1, 0,1, 1}, then the instruction contains two 34 bit operations. The leftmost is routed to issue slot 0 and the right most is routed to issueslot 3. If Format={1, 1, 1, 1, 1, 0, 1, 0, 1, 0}, then the instructioncontains three 34 bit operations, the left most is routed to issue sot2, the second operation is intended for issue slot 3, and the right mostbelongs to issue slot 4.

[0139] The format used to decompress branch target instructions is aconstant. Constant_Format={0, 1, 0, 1, 0, 1, 0, 1, 0, 1} for thepreferred five issue slot machine.

[0140] 6. Operation Formats

[0141] The format of an operation depends on the following properties

[0142] zeroary, unary, or binary;

[0143] parametric or non-parametric. Parametric instructions contain animmediate operand in the code. Parameters can be of differing sizes.Here there are param7, i.e. seven bit parameters, and param32, i.e. 32bit parameters.

[0144] result producing or resultless;

[0145] long or short op code. The short op codes are the 32 mostfrequent op codes and are five bits long. The long op codes are eightbits long and include all of the op codes, including the ones which canbe expressed in a short format. Op codes 0 to 31 are reserved for the 32short op codes

[0146] guarded or unguarded. An unguarded instruction has a constantvalue of the guard of TRUE.

[0147] latency. A format bit indicates if operations have latency equalto one or latency larger than 1.

[0148] signed/unsigned. A format bit indicates for parametric operationsif the parameter is signed or unsigned.

[0149] The guarded or unguarded property is determined in theuncompressed instruction format by using the special register fileaddress of the constant 1. If a guard address field contains the addressof the constant 1, then the operation is unguarded, otherwise it isguarded. Most operations can occur both in guarded and unguardedformats. An immediate operation, i.e. an operation which transfers aconstant to a register, has no guard field and is always unguarded.

[0150] Which op codes are included in the list of 32 short op codesdepends on a study of frequency of occurrence which could vary dependingon the type of software written.

[0151] The table II below lists operation formats used by the invention.Unless otherwise stated, all formats are: not parametric, with result,guarded, and long op code. To keep the tables and figures as simple aspossible the following table does not list a special form for latencyand signed/unsigned properties. These are indicated with L and S in theformat descriptions. For non-parametric, zeroary operations, the unaryformat is used. In that case the field for the argument is undefined.TABLE II OPERATION TYPE SIZE <binary-unguarded-short> 26<unary-param7-unguarded- 26 short> <binary-unguarded-param7- 26resultless-short> <unary-short> 26 <binary-short> 34<unary-param7-short> 34 <binary-param7-resultless- 34 short><binary-unguarded> 34 <binary-resultless> 34 <unary-param7-unguarded> 34<unary> 34 <binary-param7-resultless> 42 <binary> 42 <unary-param7> 42<zeroary-param32> 42 <zeroary-param32-resultless> 42

[0152] For all operations a 42-bit format is available for use in branchtargets. For unary and binary-resultless operations, the <binary> formatcan be used. In that case, unused fields in the binary format haveundefined values. Short 5-bit op codes are converted to long 8-bit opcodes by padding the most significant bits with 0's. Unguardedoperations get as a guard address value, the register file address ofconstant TRUE. For store operations the 42 bit,binary-param7-resultless> format is used instead of the regular 34 bit<binary-param7-resultless short> format (assuming store operationsbelong to the set of short operations).

[0153] Operation types which do not appear in table II are mapped ontothose appearing in table II, according to the following table ofaliases: TABLE II FORMAT ALIASED TO zeroary unary unary_resultless unarybinary_resultless_short binary_resultless zeroary_param32_shortzeroary_param32 zeroary_param32_resultless_shortzeroary_param32_resultless zeroary_short unary unary_resultless_shortunary binary_resultless_unguarded binary_resultless unary_unguardedunary binary_param7_resultless_unguarded binary_param7_resultlessunary_unguarded unary binary_param7_resultless_unguardedbinary_param7_resultless zeroary_unguarded unaryunary_resultless_unguarded_short binary_unguarded_shortunary_unguarded_short unary_short zeroary_param32_unguarded_shortzeroary_param32 zeroary_parame32_resultless_(—)zeroary_param32_resultless unguarded_short zeroary_unguarded_short unaryunary_resultless_unguarded_short unary unary_long binary binary_longbinary binary_resultless_long binary unary_param7_long unary_param7binary_param7_resultless_long binary_param7_resultlesszeroary_param32_long zeroary_param32 zeroary_param32_resultless_longzeroary_param32_resultless zeroary_long binary unary_resultless_longbinary

[0154] The following is a table of fields which appear in operations:TABLE III FIELD SIZE MEANING src1 7 register file address of firstoperand src2 7 register file address of second operand guard 7 registerfile address of guard dst 7 register file address of result param 7/32 7bit parameter or 32 bit immediate value op code 5/8 5 bit short op codeor 8 bit long op code

[0155]FIG. 5 includes a complete specification of the encoding ofoperations.

[0156] 7. Extensions of the Instruction Format

[0157] Within the instruction format there is some flexibility to addnew operations and operation forms, as long as encoding within a maximumsize of 42 bits is possible.

[0158] The format is based on 7-bit register file address. For registerfile addresses of different sizes, redesign of the format anddecompression hardware is necessary.

[0159] The format can be used on machines with varying numbers of issueslots. However, the maximum size of the instruction is constrained bythe word size in the instruction cache. In a 4 issue slot machine themaximum instruction size is 22 bytes (176 bits) using four 42-bitoperations plus 8 format bits. In a five issue slot machine, the maximuminstruction size is 28 bytes (224-bits) using five 42-bit operationsplus 10 format bits.

[0160] In a six issue slot machine, the maximum instruction size wouldbe 264 bits, using six 42-bit operations plus 12 format bits. If theword size is limited to 256 bits, and six issue slots are desired, thescheduler can be constrained to use at most 5 operations of the 42 bitformat in one instruction. The fixed format for branch targets wouldhave to use 5 issue slots of 42 bits and one issue slot of 34 bits.

Compressing the Instructions

[0161]FIG. 8 shows a diagram of how source code becomes a loadable,compressed object module. First the source code 801 must be compiled bycompiler 802 to create a first set of object modules 803. These modulesare linked by linker 804 to create a second type of object module 805.This module is then compressed and shuffled at 806 to yield loadablemodule 807.

[0162] Any standard compiler or linker can be used. Appendix D givessome background information about the format object modules in theenvironment of the invention. Object modules II contain a number ofstandard data structures. These include: a header; global & local symboltables; reference table for relocation information; a section table; anddebug information, some of which are used by the compression andshuffling module 807. The object module II also has partitions,including a text partition, where the instructions to be processedreside, and a source partition which keeps track of which source filesthe text came from.

[0163] A high level flow chart of the compression and shuffling moduleis shown at FIG. 9. At 901, object module II is read in. At 902 the textpartition is processed. At 903 the other sections are processed. At 904the header is updated. At 905, the object module is output.

[0164]FIG. 10 expands box 902. At 1001, the reference table, i.e.relocation information is gathered. At 1002, the branch targets arecollected, because these are not to be compressed. At 1003, the softwarechecks to see if there are more files in the source partition. If so, at1004, the portion corresponding to the next file is retrieved. Then, at1005, that portion is compressed. At 1006, file information in thesource partition is updated. At 1007, the local symbol table is updated.

[0165] Once there are no more files in the source partition, the globalsymbol table is updated at 1008. Then, at 1009, address references inthe text section are updated. Then at 1010, 256-bit shuffling iseffected. Motivation for such shuffling will be discussed below.

[0166]FIG. 11 expands box 1005. First, it is determined at 1101 whetherthere are more instructions to be compressed. If so, a next instructionis retrieved at 1102. Subsequently each operation in the instruction iscompressed at 1103 as per the tables in FIGS. 5a and 5 b and a scattertable is updated at 1108. The scatter table is a new data structure,required as a result of compression and shuffling, which will beexplained further below. Then, at 1104, all of the operations in aninstruction and the format bits of a subsequent instruction are combinedas per FIGS. 4a-4 e.

[0167] Subsequently the relocation information in the reference tablemust be updated at 1105, if the current instruction contains an address.

[0168] At 1106, information needed to update address references in thetext section is gathered. At 1107, the compressed instruction isappended at the end of the output bit string and control is returned tobox 1101. When there are no more instructions, control returns to box1006.

[0169] Appendices B and C are source code appendices, in which thefunctions of the various modules are as listed below: TABLE IV Name ofmodule identification of function performed scheme_table readableversion of table of FIGS. 5a and 5b comp_shuffle.c 256-bit shuffle, seebox 1010 comp_scheme.c boxes 1103-1104 comp_bitstring.c boxes 1005 &1009 comp_main.c controls main flow of FIGS. 9 and 10 comp_src.c,miscellaneous support routines for comp_reference.c, performing otherfunctions listed in comp_misc.c, comp_btarget.c

[0170] The scatter table, which is required as a result of thecompression and shuffling of the invention, can be explained as follows.The reference table contains a list of locations of addresses used bythe instruction stream and corresponding list of the actual addresseslisted at those locations. When the code is compressed, and when it isloaded, those addresses must be updated.

[0171] Accordingly, the reference table is used at these times to allowthe updating.

[0172] However, when the code is compressed and shuffled, the actualbits of the addresses are separated from each other and reordered.Therefore, the scatter table lists, for each address in the referencetable, where EACH BIT is located. In the preferred embodiment the tablelists, a width of a bit field, an offset from the corresponding index ofthe address in the source text, a corresponding offset from thecorresponding index in the address in the destination text.

[0173] When object module III is loaded to run on the processor, thescatter table allows the addresses listed in the reference table to beupdated even before the bits are deshuffled. [??}

Decompressing the Instructions

[0174] In order for the VLIW processor to process the instructionscompressed as described above, the instructions must be decompressed.After decompression, the instructions will fill the instructionregister, which has N issue slots, N being 5 in the case of thepreferred embodiment. FIG. 12 is a schematic of the decompressionprocess. Instructions come from memory 1201, i.e. either from the mainmemory 104 or the instruction cache 105. The instructions must then bedeshuffled 1201, which will be explained further below, before beingdecompressed 1203. After decompression 1203, the instructions canproceed to the CPU 1204.

[0175] Each decompressed operation has 2 format bits plus a 42 bitoperation. The 2 format bits indicate one of the four possible operationlengths (unused issue slot, 26-bit, 34-bit, or 42-bit). These formatbits have the same values is “Format” in section 5 above. If anoperation has a size of 26 or 34 bits, the upper 8 or 16 bits areundefined. If an issue slot is unused, as indicated by the format bits,then all operation bits are undefined and the CPU has to replace the opcode by a NOP op code (or otherwise indicate NOP to functional units).

[0176] Formally the decompressed instruction format is

[0177] <decompressed instruction>::={<decompressed operation>}N

[0178] <decompressed operation>::=<operation:42><format:2>

[0179] Operations have the format as in Table III (above).

[0180] Appendix A is VERILOG code which specifies the functioning of thedecompression unit. VERILOG code is a standard format used as input tothe VERILOG simulator produced by Cadence Design Systems, Inc. of SanJose, Calif. The code can also be input directly to the design compilermade by Synopsys of Mountain View Calif. to create circuit diagrams of adecompression unit which will decompress the code. The VERILOG codespecifies a list of pins of the decompression unit these are TABLE V #of pins name of group in group of pins description of group of pins 512data512 512 bit input data word from memory, i.e. either from theinstruction cache or the main memory  32 PC input program counter  44operation4 output contents of issue slot 4  44 operation3 outputcontents of issue slot 3  44 operation2 output contents of issue slot 2 44 operation1 output contents of issue slot 1  44 operation0 outputcontents of issue slot 0  10 format_out output duplicate of format bitsin operations  32 first_word output first 32 bits pointed to by programcounter  1 format_ctr10 is it a branch target or not?  1, each reissue1input global pipeline control stall_in signals freeze reset clk

[0181] Data 512 is a double word which contains an instruction which iscurrently of interest. In the above, the program counter, PC is used todetermine data512 according to the following algorithm:

[0182] A:={PC[31:8],8′b0}

[0183] if PC[5]=0 then

[0184] data512′:={M(A), M(A+32)}

[0185] else data512′:={M(A+32),M(A)}

[0186] where

[0187] A is the address of a single word in memory which contains aninstruction of interest;

[0188] 8′b0 means 8 bits which are zeroed out

[0189] M(A) is a word of memory addressed by A;

[0190] M(A+32) is word of memory addressed by A+32;

[0191] data512′ is the shuffled version of data 512

[0192] This means that words are swapped if an odd word is addressed.Operations are delivered by the decompression unit in a form which isonly partially decompressed, because the operation fields are not alwaysin the same bit position. Some further processing has to be done toextract the operation fields from their bit position, most of which canbe done best in the instruction decode stage of the CPU pipeline. Forevery operation field this is explained as follows:

[0193] src1

[0194] The src1 field is in a fixed position and can be passed directlyto the register file as an address. Only the 32-bit immediate operationdoes not use the src1 field. In this case the CPU control will not usethe src1 operand from the register file.

[0195] src2

[0196] The src2 field is in a fixed position if it is used and can bepassed directly to the register file as address. If it is not used ithas an undefined value. The CPU control makes sure that a “dummy” src2value read from the register file is not used.

[0197] Guard

[0198] The guard field is in a fixed position if it is used and can bepassed directly to the register file as an address. Simultaneously withregister file access, the CPU control inspects the op code and formatbits of the operation. If the operation is unguarded, the guard valueread from the RF (register file) is replaced by the constant TRUE.

[0199] op Code

[0200] Short or long op code and format bits are available in a fixedposition in the operation. They are in bit position 21-30 plus the 2format bits. They can be fed directly to the op code decode with maximumtime for decoding.

[0201] dst

[0202] The dst field is needed very quickly in case of a 32-bitimmediate operation with latency 0. This special case is detectedquickly by the CPU control by inspecting bit 33 and the formal bits. Inall other cases there is a full clock cycle available in the instructiondecode pipeline state to decode where the dst field is in the operation(it can be in many places) and extract it.

[0203] 32-Bit Immediate

[0204] If there is a 32-bit immediate it is in a fixed position in theoperation. The 7 least significant bits are in the src2 field in thesame location as a 7-bit parameter would be.

[0205] 7-Bit Parameter

[0206] If there is a 7-bit parameter it is in the src2 field of theoperation. There is one exception: the store with offset operation. Forthis operation, the 7-bit parameter can be in various locations and ismultiplexed onto a special 7-bit immediate bus to the data cache.

Bit Swizzling

[0207] Where instructions are long, e.g. 512 bit double words, cachestructure becomes complex. It is advantageous to swizzle the bits of theinstructions in order to simplify the layout of the chip. Herein, thewords swizzle and shuffle are used to mean the same thing. The followingis an algorithm for swizzling bits, see also comp_shuffle.c in thesource code appendix. for (k=0; k<4; k=k+1)  for (i=0; i<8; i=i+1)   for(j=0; j<8; j=j+1)   begin    word_shuffled[k*64+j*8+i] =    word_unshuffled[(4*i+k)*8 + j]   end

[0208] where i, j, and k are integer indices; word_shuffled is a matrixfor storing bits of a shuffled word; and word_unshuffled is matrix forstoring bits of an unshuffled word.

Cache Structure

[0209]FIG. 6a shows the functioning on input of a cache structure whichis useful in efficient processing of VLIW instructions. This cacheincludes 16 banks 601-616 of 2 k bytes each. These banks share an inputbus 617. The caches are divided into two stacks. The stack on the leftwill be referred to as “low” and the stack on the right will be referredto as “high”.

[0210] The cache can take input in only one bank at a time and then only4 bytes at a time. Addressing determines which 4 bytes of which bank arebeing filled. For each 512 bit word to be stored in the cache, 4 bytesare stored in each bank. A shaded portion of each bank is illustratedindicating corresponding portions of each bank for loading of a givenword. These shaded portions are for illustration only. Any given wordcan be loaded into any set of corresponding portions of the banks.

[0211] After swizzling according to the algorithm indicated above,sequential 4 byte portions of the swizzled word are loaded into thebanks in the following order 608, 616, 606, 614, 604, 612, 602, 610,607, 615, 605, 613, 603, 611, 601, 609. The order of loading of the 4byte sections of the swizzled word is indicated by roman numerals in theboxes representing the banks.

[0212]FIG. 6b shows how the swizzled word is read out from the cache.

[0213]FIG. 6b shows only the shaded portions of the banks of the lowstack. The high portion is analogous. Each shaded portion 601 a-608 ahas 32 bits. The bits are loaded onto the output bus, called bus256low,using the connections shown, i.e. in the following order: 608 a-bit0,607 a-bit 0, . . . , 601 a-bit 0; 608 a-bit 1, 607 a-bit1, . . . , 601a-bit 1; . . . ; 608 a-bit 31, 607 a-bit 31, . . . , 601 a-bit 31. Usingthese connections, the word is automatically de-swizzled back to itsproper bit order.

[0214] The bundles of wires, 620, 621, . . . , 622 together form theoutput bus256 low. These wires pass through the cache to the outputwithout crossing

[0215] On output, the cache looks like FIG. 7. The bits are read outfrom stack low 701 and stack high 702 under control of control unit 704through a shift network 703 which assures that the bits are in theoutput order specified above. In this way the entire output of the 512bit word is assured without bundles 620, 621, . . . 622 and analogouswires crossing.

We claim:
 1. A computer storage medium comprising a stored VLIWinstruction, which instruction comprises a plurality of operations forcommencing execution in a same machine cycle on distinct functionalunits of a VLIW processor, each non-null operation being compressedaccording to a compression scheme which assigns a compressed operationlength to that operation, the compressed operation length being chosenfrom a plurality of finite lengths, which finite lengths include atleast two non-zero lengths, which of the finite lengths is chosen beingdependent upon at least one feature of the operation.
 2. The medium ofclaim 1 wherein the set of operation lengths is {0, 26, 34, 42}.
 3. Themedium of claim 1 wherein the at least one feature is at least one ofthe following: abbreviated op code; guarded or unguarded; resultless;immediate parameter with fixed number of bits; and zeroary, unary, orbinary.
 4. The medium of claim 3 wherein combined operation types arealiased according to the following table FORMAT ALIASED TO zeroary unaryunary_resultless unary binary_resultless_short binary_resultlesszeroary_param32_short zeroary_param32 zeroary_param32_resultless_shortzeroary_param32_resultless zeroary_short unary unary_resultless_shortunary binary_resultless_unguarded binary_resultless unary_unguardedunary binary_param7_resultless_unguarded binary_param7_resultlessunary_unguarded unary binary_param7_resultless_unguardedbinary_param7_resultless zeroary_unguarded unaryunary_resultless_unguarded_short binary_unguarded_shortunary_unguarded_short unary_short zeroary_param32_unguarded_shortzeroary_param32 zeroary_parame32_resultless_(—)zeroary_param32_resultless unguarded_short zeroary_unguarded_short unaryunary_resultless_unguarded_short unary unary_long binary binary_longbinary binary_resultless_long binary unary_param7_long unary_param7binary_param7_resultless_long binary_param7_resultlesszeroary_param32_long zeroary_param32 zeroary_param32_resultless_longzeroary_param32_resultless zeroary_long binary unary_resultless_longbinary


5. The medium of claim 3, wherein the fixed number is one of 7 and 32.6. The medium of claim 1 comprising a plurality of such instructions, ofwhich one instruction is a branch target, which one instruction is notcompressed.
 7. The medium of claim 1 wherein each operation field withineach instruction includes a sub-field specifying at least one of thefollowing: a register file address of a first operand; a register fileaddress of a second operand; a register file address of guardinformation; a register file address of a result; an immediateparameter; and an op code.
 8. The medium of claim 1 comprising aplurality of such instructions, each instruction comprising a formatfield for specifying a plurality of respective formats, one respectiveformat for each operation of a succeeding instruction.
 9. The medium ofclaim 8, wherein the compressed format comprises a format fieldspecifying issue slots of the VLIW processor to be used by someinstruction.
 10. The medium of claim 9 comprising at least one fieldspecifying the operation.
 11. The medium of 9 wherein the format fieldhas 2*N bits, where N is the number of issue slots.
 12. The medium of 10wherein the at least one field specifying the operation comprises atleast one byte aligned sub-field.
 13. The medium of claim 10 furthercomprising at least one operation part sub-field located in a same bytewith the format field.
 14. The medium of claim 9 wherein the instructiontakes up no more than 32 bytes.
 15. The medium of claim 1 comprising aplurality of such instructions, wherein at least two of the instructionshave different lengths.
 16. The medium of claim 1 wherein theinstruction is aligned with a byte boundary, but not a word boundary.17. The medium of claim 13 wherein the format field specifies that morethan a threshold quantity of issue slots are to be used and furthercomprising at least one first operation part sub-field located in a samebyte with the format field, a plurality of sub-fields specifyingoperations, and at least one second operation part sub-field located ina byte separate from the other sub-fields.
 18. The medium of claim 9formatted as follows <instruction>::= <instruction start> <instructionmiddle> <instruction end> <instruction extension> <instruction start>::=<Format:2*N>{<padding:1>}V2{<2-bit operation part:2>}V1{<24-bitoperation part :24>}V1 <instruction middle>::={{<2-bit operationpart:2>}4 {24-bit operation part:24>}4}V3 <instructionend>::={<padding:1>}V5{<2-bit operation part:2>}V4 {24-bit operationpart:24>}V4 <instruction extension>::={<operationextension:0/8/16>}S<padding>::=“0” Wherein the variables used above are defined as follows:N=the number of issue slots of the machine, N>0 S=the number of issueslots used in this instruction (0≦S≦N) C1=4−(N mod 4) If (S≦C1) thenV1=S and V2=2*(C1−V1) If (S>C1) then V1=C1 and V2=0 V3=(S−V1) div 4V4=(S−V1) mod 4 If (V4>0) then V5=2*(4−V4) else V5=0 Explanation ofnotation ::=means “is defined as” <field name:number> means the fieldindicated before the colon has the number of bits indicated after thecolon. {<field name>}number means the field indicated in the anglebrackets and braces is repeated the number of times indicated after thebraces “0” means the character “0” “div” means integer divide “mod”means modulo :0/8/16 means that the field is 0, 8, or 16 bits long. 19.The medium of claim 9 containing an operation which is encoded in 26, 34or 42 bits, wherein if the operation is 26 bits, it is one of binaryunguarded short; unary immediate 7-bit parameter unguarded operation;binary unguarded immediate 7-bit operand resultless short; and unaryshort; if the operation is 34 bits, it is one of binary short; unaryimmediate 7-bit parameter resultless short; binary unguarded; unaryimmediate 7-bit parameter unguarded; and unary; and if the operation isencoded in 42 bits, it is one of binary immediate 7-bit parameterresultless; binary; unary immediate 7-bit parameter; zeroary immediate32-bit parameter; and zeroary, immediate 32-bit parameter resultless.20. The medium of claim 9 wherein the operations are encoded accordingto the following table: 24-bit operation part 2-bit part Extension Sizebit position name 0-6 7-13 14-20 21-23 24-25 26-34-41 26-format:<binary- src1[0:6] src2[0:6] dst[0:6] opcode[0:2] opcode[3:4] 26unguarded- short> <unary- src1[0:6] param[0:6] dst[0:6] opcode[0:2]opcode[3:4] 26 param7- unguarded- short> <binary- src1[0:6] src2[0:6]param[0:6] opcode[0:2] opcode[3:4] 26 unguarded- param7- resultless-short> <unary- src1[0:6] dst[0:6] guard[0:6] opcode[0:2] opcode[3:4] 26short> 34-format: <binary- src1[0:6] src2[0:6] guard[0:6] opcode[0:2]opcode[3:4] dst[0:6] 0 34 short> <unary- src1[0:6] param[0:6] guard[0:6]opcode[0:2] opcode[3:4] dst[0:6] 0 34 param-7- short> <binary- src1[0:6]src2[0:6] guard[0:6] opcode[0:2] opcode[3:4] param 34 param7- [0:6] 0resultless- short> <binary-un src1[0:6] src2[0:6] dst[0:6] opcode[0.2]opcode[3:4] opcode[5:7] 34 guarded> XL011 <binary- src1[0:6] src2[0:6]guard[0:6] opcode[0:2] opcode[3:4] opcode[5:7] 34 resultless> X1001<unary- src1[0:6] param[0:6] dst[0:6] opcode[0:2] opcode[3:4]opcode[5:7] 34 param7-un- SL111 guarded> <unary> src1[0:6] dst[0:6]guard[0:6] opcode[0:2] opcode[3:4] opcode[5:7] 34 XL101 42-format<binary- src1[0:6] src2[0:6] guard[0:6] opcode[0:2] opcode[3:4]opcode[5:7] 42 param7- SXX100 resultless> param[0:6] <binary> src1[0:6]src2[0:6] guard{[0:6] opcode[0:2] opcode[3:4] opcode[5:7] 42 XL0101dst[0:6] <unary- src1[0:6] param[0:6] guard[0:6] opcode[0:2] opcode[3:4]opcode[5:7] 42 param7> SL1101 dst[0:6] <zeroary- param param[0:6]dst[0:6] param param param 42 param32> [7:13] [14:16] [17:18] [19:23]XX1 param [24:31] <zeroary- param param[0:6] guard[0:6] param paramparam 42 param32- [7:13] [14:16] [17:18] [19:23] resultless> 000 param[24:31] <zeroary- param param[0:6] guard[0:6] param param param 42param32- [7:13] [14:16] [17:18] [19:23] resultless> 100 param [24:31]


21. A computer storage medium comprising a stream of stored instructionsfor execution on a VLIW processor, the stream of instructionscomprising: at least one first instruction which is a branch target andwhich is uncompressed; and at least one second instruction following thebranch target which is compressed according to a scheme where formatsare assigned to instructions according to features of the instructions.22. The medium of claim 21 wherein the at least one first instruction isstored aligned with a word boundary.
 23. The medium of claim 21 whereinthe at least one second instruction is stored unaligned with a wordboundary.
 24. The medium of claim 22 wherein at least one of the atleast one first instruction or the at least one second instructionspecifies a plurality of operations for beginning in a same machinecycle.
 25. A computer storage medium comprising a stream of storedinstructions, the stream of stored instructions including a firstinstruction including a format field which specifies an instructioncompression format; and a second instruction, following the firstinstruction, that is compressed according to the format field in thefirst instruction.